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Abstract 

Background: Massively-parallel cDNA sequencing (RNA-Seq) is a new technique that holds great promise for 
cardiovascular genomics. Here, we used RNA-Seq to study the transcriptomes of matched coronary artery disease 
cases and controls in the ClinSeq® study, using cell lines as tissue surrogates. 

Results: Lymphoblastoid cell lines (LCLs) from 16 cases and controls representing phenotypic extremes for 
coronary calcification were cultured and analyzed using RNA-Seq. All cell lines were then independently re-cultured 
and along with another set of 16 independent cases and controls, were profiled with Affymetrix microarrays to 
perform a technical validation of the RNA-Seq results. Statistically significant changes (p<0.05) were detected in 
186 transcripts, many of which are expressed at extremely low levels (5-10 copies/cell), which we confirmed 
through a separate spike-in control RNA-Seq experiment. Next, by fitting a linear model to exon-level RNA-Seq read 
counts, we detected signals of alternative splicing in 18 transcripts. Finally, we used the RNA-Seq data to identify 
differential expression (p < 0.0001) in eight previously unannotated regions that may represent novel transcripts. Overall, 
differentially expressed genes showed strong enrichment (p = 0.0002) for prior association with cardiovascular disease. 
At the network level, we found evidence for perturbation in pathways involving both cardiovascular system 
development and function as well as lipid metabolism. 

Conclusions: We present a pilot study for transcriptome involvement in coronary artery calcification and 
demonstrate how RNA-Seq analyses using LCLs as a tissue surrogate may yield fruitful results in a clinical sequencing 
project. In addition to canonical gene expression, we present candidate variants from alternative splicing and novel 
transcript detection, which have been unexplored in the context of this disease. 
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Background 

Coronary Artery Disease (CAD) is a leading cause of 
mortality worldwide. The etiology of CAD is influenced 
by lifestyle, genetics, and gut microbiome [1,2] although 
the balance of these contributors remains unclear. While 
numerous GWAS projects have explored the genetics of 
CAD susceptibility [3], comparatively less is known 
about gene expression changes in this disease [4], This is 
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partly because CAD is a complex phenotype involving 
multiple physiological systems. Consequently, no single 
tissue may fully reflect the network of gene expression 
changes underlying the disease [5]. 

In addition, current knowledge of the CAD transcriptome 
is based on gene expression microarrays, a technology that 
is useful but has several limitations. Recently, RNA-Seq, a 
powerful new technique for transcriptome analysis has rev- 
olutionized gene expression analyses by providing the abil- 
ity to simultaneously interrogate all transcripts in an RNA 
sample (unlike microarrays which are limited to previously 
annotated transcripts) [6]. In addition, the high resolution 
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of RNA-Seq can facilitate discoveries of transcript dys- 
regulation that have been missed by previous technolo- 
gies. Here, we present results from a pilot application of 
RNA-Seq on human cases and controls chosen to reflect 
extremes for coronary artery calcification (CAC), a clin- 
ical marker for advanced CAD that is highly correlated 
with future adverse cardiovascular events [7]. As a pa- 
tient surrogate for gene expression, we used Epstein- 
Barr virus transformed lymphoblastoid cell lines (LCLs), 
which have been shown in multiple studies to reliably 
reflect gene expression signatures [8], particularly those 
associated with nearby ds-acting genomic polymorphisms 
(expression quantitative trait loci or eQTLs) [9]. 

Here, we used RNA-Seq to compare LCLs from matched 
CAC cases and controls in the ClinSeq® project [10], and 
report a set of candidate genes for CAD that show dif- 
ferential expression and splicing variation between the 
groups, as well as novel transcripts that may have rele- 
vance for disease. RNA-Seq results were experimentally 
and technically replicated by repeating the cell culture 
part of our experiments and then measuring gene ex- 
pression independently using expression microarrays. 
At the network level, we demonstrate that these dif- 
ferentially expressed genes are enriched for prior CAD 
association as well as for biological pathways with direct 
relevance to atherosclerosis. 

Methods 

An expanded Methods section is provided in Additional 
file 1. 

Coronary calcification scoring 

CAC scoring was performed using multi-slice computed 
tomography (Aquilion One, Toshiba Medical Systems, 
Japan or Lightspeed VCT, General Electric Healthcare, 
Waukesha, Wisconsin) with the use of prospective elec- 
trocardiographic gating. Calcification was quantified using 
the Agatston scoring method [11] on a dedicated work- 
station (VitreaFX or GE Advantage). The study was 
approved by the NHGRI Institutional Review Board and 
all subjects gave written informed consent. 

Cell culture, RNA-Seq library preparation, and sequencing 

LCLs were established by Epstein-Barr Virus (EBV) in- 
oculation of B-lymphocytes using standard procedures, 
grown to a density of 2-3 x 10 5 cells/ml, harvested 
when the culture reached 10 7 cells in total (5-6 pas- 
sages) and stored at -80°C. Total RNA from 4 x 10 6 
cells was extracted using the RNeasy Mini kit (Qiagen). 
RNA integrity was assessed using an Agilent Bioanalyzer 
(Agilent). RNA-Seq libraries were constructed using a 
custom protocol (see Additional file 1). Sequencing was 
done on Illumina GAII X sequencers (Illumina), collecting 
two lanes of 51 bp reads for each library. 



RNA-Seq data processing 

Reads were aligned to the human genome (hgl8) using 
TopHat [12]. The TopHat output SAM file was con- 
verted to BED format using the bamToBed script from 
the BEDTools package [13]. The coverageBed script 
from BEDTools was used to count reads mapping to 
individual exons in the RefSeq database (see Additional 
file 1). A total count for each transcript in this database 
was obtained by adding the counts for its constituent 
exons. These counts, after filtering and normalization 
(see Additional file 1), were used for statistical analysis 
of differential expression. We confirmed the subject 
identity in all RNA-Seq libraries by genotyping a subset 
of expressed SNPs from the corresponding genomic 
DNA. This indicated some contamination in the data for 
subject 133871; hence, this subject was excluded from 
further RNA-Seq analyses. The transcript-level RNA-Seq 
counts were first analyzed using a two level, one-way 
ANOVA to compare the expression in the case and con- 
trol groups. Concurrently, the count data were analyzed 
using the edgeR Bioconductor package with default set- 
tings using the moderated tagwise dispersion option and 
the prior. N parameter set to 4. 

Affymetrix exon array experiment 

For each sample, two micrograms of total RNA were 
used in conjunction with the Whole-Transcript Expres- 
sion Analysis protocol for Affymetrix Human Exon 1.0 
ST arrays. Arrays were scanned using an Affymetrix 
Gene Chip Scanner 3000 and intensities were calculated 
using Affymetrix AGCC software. RMA (Robust Multi- 
chip Average) signal intensities were calculated using Affy- 
metrix Expression Console, converted to arithmetic scale, 
and transformed using an adaptive variance-stabilizing, 
quantile-normalizing transformation that was scaled to 
match the transformation used on the RNA-Seq data. A 
two level, one-way ANOVA was run on the microarray 
expression results. 

Western blot analysis of IGLL5 

A small piece of human carotid atherosclerotic plaque 
was frozen in liquid nitrogen and crushed to fine powder 
using an alloy tool steel mortar and pestle set that was 
pre-cooled in liquid nitrogen. The tissue powder was 
transferred to an ice-cold microcentrifuge tube and ice- 
cold cell lysis buffer (RayBiotech #0103004-L) containing 
protease inhibitor cocktail (Sigma P8430) was added; the 
tissue weight to buffer volume ratio was 0.1 g/0.3 ml. 
The samples were vortexed for 20 seconds at 10-minute 
intervals for 30 mins (and were kept on ice when not be- 
ing vortexed). Samples were then centrifuged at 14000 g 
for 15 minutes at 4°C; the supernatant was collected in 
new microcentrifuge tubes. Protein concentrations in 
the supernatant were quantified by using Micro BCA 
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Protein Assay Kit (Pierce #23225) according to manufac- 
turers instructions. 50ug of protein from each sample was 
separated on a 4 - 20% polyacrylamide gel (Bio-Rad, #161- 
1159) withTris-glycine - SDS buffer (Bio-Rad, #161-0732). 
The proteins were transferred onto Immuno-Blot PVDF 
membranes (Bio-Rad, #162-0174). The membrane was 
blocked with 5% non-fat dry milk in Tris-buffered saline 
Tween 20 (0.5%) for 1 hour at room temperature, then 
incubated with primary antibody (anti human IGLL5, 
Abgent #AP 18459b) at a concentration of 1:200 over- 
night at 4°C. The membranes were washed with Tris- 
buffered saline Tween 20 (0.5%) three times, incubated 
with a secondary peroxidase - linked antibody (anti 
rabbit IgG HRP conjugate, Bio-Rad #170-6515) for two 
hours at room temperature and then washed three 
times with Tris-buffered saline Tween 20 (0.5%). The re- 
active bands were visualized by a chemiluminescence 
kit (Bio-Rad, #170-5040) on Kodak BioMax Light film. 
The intensity of each IGLL5 band was analyzed by using 
the Chemidoc XRS system with Western blot analysis 
software (Bio-Rad). The final intensity was adjusted by in- 
dividual loading control intensity readings (beta-actin). 

Detection of novel transcripts 

The Ensembl, AceView, ccDs, knownGenes, refGene, 
tRNA and rnaGene annotation tables were downloaded 
in GTF format from the UCSC Genome Browser (http:// 
www.genome.ucsc.edu) and combined into one master 
annotation file using the Cuffmerge tool in Cufflinks 
[14]. Transcripts were assembled using RNA-Seq data 
from a separate LCL library with extremely high depth 
of sequencing (-200 million read pairs) using Cufflinks 
with the -M option to mask transcripts in the combined 
annotation file. These putative novel transcripts were 
then used as a reference annotation to run Cuffdiff, with 
the same BAM files from the case and control subjects that 
were used for quantifying gene expression. Loci detected 
as significant after Benjamini-Hochberg multiple testing 
correction were visually inspected by viewing the raw se- 
quencing data (in BAM and BigWig format) on the UCSC 
Genome Browser with all gene and gene prediction tracks 
(and the ENCODE gene regulation track) turned on. 

Detection of alternatively spliced transcripts 

For alternative splicing, the exon-level counts were fit to 
a linear mixed-effect three-factor ANOVA model con- 
taining the following terms: 

Jijk =V+Ai + fi m + C + AC ik + e ijk 

where y was the normalized read count for an exon, A t 
was the fixed treatment effect for 1 through i treatments 
(in this case, the case or control status), /5 y ^ was the 



random sample effect for sample ; within treatment, 
C k was the fixed exon effect for 1 through k exons 
within a transcript, AC ik was the fixed interaction "treat- 
ment X exon" effect and e was the error factor. The 
ANOVA p-value for p-AC (which indicates the strength 
of the exon-treatment interaction) was then used to 
select for exons showing significantly different usage 
between cases and controls. Independently, the cuffdiff 
algorithm [14] was also used to detect alternatively 
spliced transcripts. 

Results 

Assessment of CAD burden and RNA-Seq experimental 
design 

To quantify CAD status, study subjects were assessed at 
enrolment for CAC, using multi-slice computed tomog- 
raphy followed by Agatston scoring (see Methods). CAC 
scoring is a robust marker of CAD [15] and has been 
demonstrated to be useful for both calibration and dis- 
crimination of the disease burden. In addition, this 
measurement of CAD also has superior positive pre- 
dictive value for future adverse cardiovascular events 
[16,17]. We selected eight age-, sex-, and ethnicity- 
matched casexontrol pairs from the extremes of the cor- 
onary calcium score distribution for combined RNA-Seq 
and microarray analysis and another eight matched case: 
control pairs for microarray- only analysis (Table 1). The 
median CAC scores for cases in the first and second 
groups were 1531.5 and 682.5, respectively. For compari- 
son, even the presence of a CAC score (i.e., any non-zero 
value) is clinically considered indicative of CAD, while a 
score of 400 is often considered an advanced disease state 
[17]. When the median age for the cases in the first and 
second groups (56 and 61.5 years, respectively) and the 
ethnicity of these subjects (Caucasian) was considered, 
these scores corresponded to the 99 th and 93 rd centiles, 
respectively, as measured using the CAC score distribu- 
tion from 6110 participants in the Multi-Ethnic Study 
of Atherosclerosis (MESA) [18]. This demonstrated the 
severity of CAD in our discovery cases. As part of the 
ClinSeq® protocol, all subjects were also analyzed using 
a set of 123 clinical chemistry tests and six phenotypic 
measurements (Additional file 2: Table SI). Results 
from these tests did not show significant association 
with calcification scores. 

Analyses of gene expression with RNA-Seq data 

To convert the RNA-Seq read data into a quantitative 
measure of gene expression, we calculated the number 
of RNA-Seq reads mapping to transcripts in the RefSeq 
database (see Methods). Such transcript-level read counts 
are better suited for statistical testing of gene expression 
differences compared to other RNA-Seq expression 
metrics [19]. Several statistical techniques have been 
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Table 1 Clinical data for 32 subjects 



Controls 



Cases 



ID 


age 


Calcification score 


ID 


age 


Calcification score 


Discovery cohort (RNA-Seq + Microarray) 










112916 


56 


0 


163383 


55 


1114 


116099 


62 


0 


108971 


63 


1370 


119657 


53 


0 


198564 


53 


1012 


133237 


53 


0 


121424 


52 


514 


113235 


53 


0 


165583 


53 


4352 


118655 


58 


0 


189793 


57 


3532 


119522 


65 


0 


158592 


66 


2885 


133871 


56 


0 


132306 


59 


1693 


Validation cohort (Microarray) 










199039 


64 


0 


129854 


64 


1268 


124644 


58 


0 


106945 


58 


890 


199794 


58 


0 


116106 


58 


697 


102488 


63 


0 


108828 


64 


693 


184917 


57 


0 


155489 


57 


672 


154405 


60 


0 


148238 


61 


668 


117532 


62 


0 


142361 


63 


654 


192736 


62 


0 


107143 


62 


565 



All subjects were males of Caucasian ethnicity. 

developed for quantifying differential expression from 
RNA-Seq count data. To reduce artifacts arising from 
an arbitrary choice of any single technique, we used a 
dual analysis approach. First, we analyzed the count 
data with the Bioconductor package edgeR [20], which 
uses the negative binomial distribution to model the 
count data. Using the common dispersion setting in 
edgeR, we found 978 transcripts to be differentially 
expressed transcripts (p < 0.05), of which 55 were 
significant after applying a 5% Benjamini-Hochberg 
correction for multiple testing (Figure 1A). However, 
edgeR and other RNA-Seq analysis tools using the 
negative binomial distribution have high rates of false 



discovery [21]. To reduce artifacts arising from an 
arbitrary choice of any single technique, we applied a 
second, more conservative test (one-way ANOVA) 
(Figure IB) in addition to edgeR, which returned 515 
transcripts at p < 0.05 without correction for multiple 
testing. Overall, results from both techniques showed 
good agreement, although edgeR was more liberal than 
the ANOVA test in assigning low significance values. To 
define a gene list, we selected transcripts that were signifi- 
cant at p < 0.05 in both the edgeR and ANOVA results, 
retaining 186 transcripts from 116 genes (Additional file 2: 
Table S2). We noted here that neither edgeR nor ANOVA 
tests were designed for isoform-level quantification; hence, 



0.1 

0.01 
0.001- 
0.0001 
0.00001 
0.000001 - 
0.0000001 ■ 
1e-8 
1e-9 
1e-10 





-1 -0.5 



Log 10 fold change (edgeR) Log 10 fold change (ANOVA) 

Figure 1 Volcano plots from edgeR (panel A) and ANOVA (panel B) analyses of RNA-Seq count data. X-axis on both panels shows base 10 
logarithm of fold change (case/control). Y axis shows p value. Red dots indicate 186 transcripts meeting p < 0.05 in both tests. 
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for genes with multiple transcripts in this list of 186, 
we could not identify which isoform was differentially 
expressed. 

An interesting feature of the RNA-Seq results was that 
many transcripts expressed at very low levels showed 
significant differences between cases and controls. As 
genes expressed at near-background levels have historic- 
ally been problematic for microarray technology [22], 
this aspect of the CAD transcriptome may benefit 
greatly from the increased dynamic range of RNA-Seq. 
To quantify absolute expression of these rare tran- 
scripts, we conducted a separate RNA-Seq experiment, 
starting with the same amount of total RNA used for 
other experiments in this study, with the difference be- 
ing that the number of cells used for RNA extraction 
was precisely quantified. To the input RNA, we added 
seven polyadenylated non-mammalian RNA spikes, quan- 
tified such that their abundance (measured as copies per 
cell) spanned five orders of magnitude (Additional file 2: 
Table S3). Next, after sequencing this library to the same 
depth as the others (-40 million reads), we converted 
RNA-Seq reads counts of these spikes to RPKM (Reads 
Per Million per Kilobase) values [19]. Finally, by compar- 
ing RPKM values of differentially expressed transcripts in 
our study with the spike-in control experiment, we deter- 
mined that the cellular abundance of the rarer transcripts 
in our data approaches ~5 copies per cell (Figure 2). 
As many of the candidate genes with the highest statis- 
tical significance belong to this low-expression cat- 
egory (including LOC100131347 which had the lowest 
p value in our ANOVA results), some observations are 
relevant to note here. Firstly, before statistical testing, 
we screened out transcripts which had minimal or no 
coverage (see Additional file 1). Hence, these results 
are unlikely to represent statistical artifacts arising 
from low RNA-Seq coverage. Regarding quantity of 
input material, ten micrograms of total RNA were used 
for all RNA-Seq experiments, which is far above the mini- 
mum requirement for the Illumina RNA-Seq protocol. 
Consequently, paucity of input RNA should not cause 
transcripts to appear "rare" if they are otherwise expressed 
abundantly. Finally, each sample was sequenced using two 
full lanes of an Illumina flow cell for a relatively high 
depth of 40 million reads, which suggests that lack of 
sequencing coverage would not bias our results. Overall, 
these findings suggest that low-expression transcripts 
(many of which may have escaped detection with previ- 
ous techniques) constitute an unexplored area of the 
CAD transcriptome that may contain clinically relevant 
information. 

Validation of RNA-Seq results using affymetrix arrays 

To address the effects of variation introduced by experi- 
mental procedures, we repeated the cell culture and 




controls - Iog10 expression 



Figure 2 Plot showing RPKM values of 186 differentially 

expressed transcripts and their absolute expression (measured 

as copies/cell). Log 10 RPKM values for controls and cases are 

shown on the x and y-axes, respectively. The dashed diagonal line 

represents equal RPKM values for both cases and controls, and hence 

no differential expression. Thus, the orthogonal distance to the dashed 

line of a point indicates the amount of differential expression. The blue 

numbers on the dashed line indicate the logl 0 measure of copies/cell 

for the seven spike-in RNA controls (ranging from 1-100000), and the 

position along the x and y-axes indicate the corresponding RPKM 

values recorded for these spikes. Based on a linear regression 

model for copies/cell as a function of RPKM, predictions for absolute 

expression measured as copies/cell for the 186 transcripts were made 

from these spike-in values. For each position, the average of the RPKMs 

corresponding to the x and y-coordinate values yield predictions of cell 

copies based on this model and a corresponding color shown in the 

background of the plot ranging from green (low) to red (high). 
I J 

RNA extraction steps for all 16 cell lines used for the 
RNA-Seq, starting from previously frozen stocks. We 
then measured gene expression in these newer cultures 
using Affymetrix Human Exon 1.0 ST microarrays. Pre- 
vious studies have demonstrated that these two methods 
of expression measurement (RNA-Seq and microarray) 
show strong agreement when applied to the same RNA 
sample [23]. Here, despite the fact that two additional 
passages of the cell lines took place in between the sep- 
arate experiments, comparisons of the sixteen pairs of 
RNA-Seq and microarray results from the same subjects 
still showed high correlation (average R = 0.73, p < 0.001) 
(Additional file 1: Figure SI). Hence, we conclude that 
in vitro procedures did not appear to cause significant 
variability in the gene expression profiles of the cell lines. 

To identify gene expression differences in cases and 
controls that were robust to both experimental (i.e., cell 
culture) replication as well as measurement technique, 
(i.e., RNA-Seq or microarray), we next selected genes 
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that showed consistent up- or downregulation in both 
experiments. From the list of 186 differentially expressed 
transcripts found by RNA-Seq, 161 were included in the 
microarray probe design. Of these, 110 were regulated in 
the same direction (i.e. showed consistent upregulation 
or downregulation) in both the microarray and RNA- 
Seq data (Figure 3A). Given that these two gene expres- 
sion measurements were conducted independently and 
on different RNA extraction batches, the observed 
agreement in fold change for 110/161 genes is highly 
significant (one-tailed Fisher s Exact Test, p < 0.0001; 
Additional file 2: Table S4). Next, we selected a valid- 
ation cohort of eight additional matched casexontrol 
pairs from the extremes of the CAC score distribution 
(Table 1) and analyzed these samples using Affymetrix 
Human Exon 1.0 ST microarrays. Despite the fact that 
cases in this cohort had a lower median calcification 
score compared to the first cohort (682.5 and 1521.5, 
respectively), we again found a significant number of 
transcripts (71 of the above 110) with consistent up- or 
downregulation in both cohorts (one-tailed Fishers Exact 
Test, p = 0.0014; Figure 3B; Additional file 2: Table S5). We 
conclude that these transcripts may be potential CAC 
biomarkers, as their expression differences persist even 
though the extent of calcification differed substantially 
when comparing our first and second cohorts. 

Finally, we reasoned that direct evidence of a causal rela- 
tionship with CAD would be obtained if any transcripts 



from LCL-based analyses exhibited differential expression 
patterns within actual atherosclerotic lesions. To identify 
such genes, we downloaded publicly available microarray 
data (GEO accession number GSE28829) from a gene 
expression study of human carotid artery autopsy sam- 
ples [24]. From these samples (which were part of the 
Maastricht Pathology Tissue Collection), RNA was col- 
lected from 13 arteries in early CAD states (pathological 
intimal thickening and intimal xanthoma) and from 16 ad- 
vanced CAD lesions (thin or thick fibrous cap atheroma). 
The two groups of samples, which were clearly separated 
by Principal Components Analysis, showed an abundance 
of genes with highly significant differential expression (4039 
genes with p < 0.05, one-way ANOVA test with 10% 
Benjamini-Hochberg FDR correction, Additional file 1: 
Figure S2). However, our aim here was not to study 
these samples themselves but to identify genes showing 
similar patterns in LCLs and atherosclerotic arteries. To 
identify such genes, we calculated fold changes of 20639 
genes that were measured in both studies (i.e. RNA-Seq 
in our project and the Affymetrix U133 Plus 2.0 micro- 
arrays used for the autopsy RNA study). Next, we se- 
lected genes that had a fold change > 1.25 in the same 
direction in all three data sets (i.e. first and second sets 
of CAC subjects in our study and in the CAD autopsy 
samples). Two genes (MMP7 and IGLL5) met this 
criterion, of which, MMP7 is especially notable due to 
its functional role in atherosclerotic plaque rupture 





Log 10 fold change (microarray) 



-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 

Log 10 fold change (batch 1) 



*• Microarray 

RNA-Secj*^^^ 


Downregulated 


Upregulated 


Total 




'^■^-^CAC Batch 2 
CAC Batchl^' , ^^ > ^ 


Downregulated 


Upregulated 


Total 


Downregulated 


69 


29 


98 




Downregulated 


41 


28 


69 


Upregulated 


22 


41 


63 




Upregulated 


11 


30 


41 


Total 


91 


70 


161 




Total 


52 


58 


110 



Two-tailed Fisher's Exact Test p<.0001 Two-tailed Fisher's Exact Test p=0.0014 

.1 .2 .3 .4 .5 6 7 8 .9 Quantile Density Contours 

Figure 3 Validation of RNA-Seq results with microarray data. (A) Fold changes (case/control) from first 16 subjects, measured by RNA-Seq 
(Y-axis) and microarrays (X-axis) in logarithm base 10 scale. Red dots show 1 10 transcripts (out of 186 in Figure 1) that were upregulated or 
downregulated in both experiments. (B) Comparison of microarray data from first and second groups of 16 subjects (X- and Y-axes, respectively). 
Red dots show 71 transcripts out of 1 10 in panel A that were upregulated or downregulated in both groups. Contingency tables for statistical 
calculations are shown below each panel. 
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and vascular smooth muscle cell apoptosis [25,26]. In 
addition, at the protein level, plasma concentrations of 
MMP7 were increased in patients with coronary artery 
disease [27]. Regarding IGLL5, this gene is interesting, 
as apart from being upregulated with disease in two 
unrelated tissues, it also had the largest fold change in 
the list of 186 significant transcripts. To investigate if 
IGLL5 is successfully translated in the arterial wall and 
if the mRNA difference persists at the protein level, we 
conducted preliminary immunoblotting experiments 
on lysates from human carotid plaques. Using an anti- 
human polyclonal antibody (which we confirmed to be 
IGLL5-specific using siRNA knockdown experiments; 
Additional file 1: Figure S3A-B), we compared three 
samples each from areas exhibiting early and advanced 
CAD (diffuse intimal thickening and fibroatheroma, 
respectively). Corresponding to the mRNA upregula- 
tion, we found a trend of increased protein levels in 
lysates from advanced disease (Figure 4). IGLL5 was 
also detected at abundant levels in the supernatants 
of the above lysates, suggesting that it is a secreted 
protein (Additional file 1: Figure S3C-D). Although the 
trend of protein upregulation was not statistically sig- 
nificant in this small set of samples (Additional file 2: 
Table S6), given the complexity of CAD etiology, follow- 
up studies of this gene using a larger sample size may be 
worthwhile. 



Analyses of transcription at unannotated genomic 
regions 

Recent evidence from RNA-Seq studies has demon- 
strated that the human genome is pervasively tran- 
scribed [28]. Thus, it is possible that CAD-relevant 
expression differences may exist in unannotated parts 
of the genome. Such differences would be even more 
interesting, as previous gene expression studies of CAD 
using microarrays would have entirely missed these re- 
gions due to their absence in the probe design. To explore 
this aspect of the transcriptome, we used the Cufflinks 
suite of RNA-Seq analysis tools (see Methods) [14]. First, 
we assembled novel transcripts, which we defined as those 
not included in a combination of several previous annota- 
tions including predicted genes. Next, we used the Cuffdiff 
tool to measure differential expression in these novel 
transcripts. Finally, we manually inspected the results to 
confirm that they showed evidence for being putative 
transcripts. Using this approach, we found eight novel 
loci that show differential expression with high statis- 
tical significance (p < 0.0001 after Benjamini-Hochberg 
correction) (Table 2). In terms of transcript structure, 
five of these eight loci contain multiple regions of in- 
creased read coverage joined by spliced RNA-Seq reads 
that mapped following the GT-AG rule (Figure 5). This 
strongly suggested the presence of a canonical exon- 
intron structure. Mapping these loci to the ENCODE 



Diffuse Intimal Thickening 



Fibroatheroma 



A IGLL5 

(23kDa) 



B ACTB 
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Figure 4 Immunoblot demonstrating the expression of IGLL5 in carotid plaques obtained at endarterectomy. Lanes 1-3: lysates from an 
area exhibiting diffuse intimal thickening (DIT). Lanes 4-6: matched lysates of fibroatheromatous plaque from the same samples. Panel A: anti-human 
IGLL5 antibody staining. Panel B: Beta-actin band used for densitometric quantification of IGLL5. Panel C: Increased trends of IGLL5 expression in 
fibroatheromatous plaques. See also Additional file 1: Figure S3 for siRNA knockdown experiments of to establish specificity of this antibody for IGLL5. 
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Table 2 Novel transcripts detected by Cuff links/Cuff merge/Cuff Diff pipeline 



Name 


Location 


Fold change (case/control) 


Corrected p value 


Spliced? 


Histone mark? 


EST? 


r\ ice i i Q/iQ 
LUrr. 1 1 DHy 


LI II O. I U I JZH/ Z\J- I U I JOOJO/ 


— 1 .OO 1 OJ 


U.U^f 1 DOJ 


Y tfS 


Y es 


t es 


CUFF.2472 


chrl 2:31 838565-31 853772 


-2.26342 


0.123977 


Yes 


Yes 


Yes 


CUFF.3 


chrl :80691 -80824 


-11.5273 


4.46E-08 


No 


Yes 


Yes 


CUFF.4 


chr1:452495-461228 


-10.6557 


7.54E-10 


Yes 


Yes 


Yes 


CUFF.5* 


chrl :1 63653-1 64729 


-12.5567 


1.96E-10 


No 


No 


Yes 


CUFF.6* 


chr1:1 6481 1-166159 


-12.8975 


1.27E-12 


No 


No 


No 


CUFF.7* 


chrl :1 66235-1 66948 


-12.7567 


1.27E-12 


Yes 


No 


Yes 


CUFF.9 


chr1:227665-257075 


-11.1513 


1.18E-08 


Yes 


Yes 


Yes 



Column 5 denotes presence of spliced reads following GT-AG rule. Column 6 denotes presence of layered H3K4Me1 or H3K4Me3 histone modification marks at 
the locus. Column 7 denotes the presence of spliced or unspliced human ESTs in the UCSC Genome Browser database. *CUFF.5, CUFF.6 and CUFF.7 were annotated by 
the Cufflinks pipeline as separate transcripts but their close location suggests that they may be components of the same transcript. 



Integrated Regulation track on the UCSC Genome 
Browser showed histone modifications indicating active 
transcription. Although functional analyses will be re- 
quired to confirm if these transcripts have genuine rele- 
vance to CAD, these results are interesting as they form 
the first description of unannotated regions of the tran- 
scriptome associated with this disease. As the ability to 
discover such novel transcripts is a significant advan- 
tage of RNA-Seq over existing techniques, we speculate 
that future analyses in different tissues and with larger 



sample sizes will continue to demonstrate that hidden 
components of CAD biology may exist outside known 
genes. 

Detection of candidate genes for alternative splicing 

The role of alternative splicing in CAD overall (or CAC 
in particular) has not been explored comprehensively, 
mainly due to the difficulties of microarray platforms 
in efficiently detecting changes in splicing patterns. In 
contrast, RNA-Seq is extremely powerful for studying 
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Scale 
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Layered H3K4Mel 



UCSC Genes Based on RefSeq. UniProt. GenBank. CCDS and Comparative Genomics 
RefSeq Genes 
AceView Gene Models With Alt-Splicing 
Consensus CDS 
Mammalian Gene Collection Full ORF mRNAs 
Vega Protein-Coding Annotations 
Vega Annotated Pseudogenes and Immunoglobulin Segments 
Ensembl Gene Predictions - archive Ensembl 54 - may2009 
■ 

Swiss Institute of Btomformatics Gene Predictions from mRNA and ESTs 
Genscan Gene Predictions 
Exoniphy Human/Mouse/Rat/Dog 
ENCODE Enhancer- and Promoter-Associated Histone Mark (H3K4Me1) on 8 Cell Lines 



Figure 5 Putative novel transcript on chromosome 12 detected by Cufflinks. UCSC Genome Browser tracks in vertical order: (A) 
Chromosomal location. (B) WIG file showing expression levels. (C) BAM file showing mapped RNA-Seq reads (blue bars for forward strand; red for 
reverse). Segments joined by thin lines represent reads spanning a putative splice junction. (D) Assorted gene and gene prediction tracks showing 
absence of prior annotations for the novel transcript. (E) ENCODE enhancer- and promoter-associated histone mark H3K4Me1. 
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splicing variation. As statistical algorithms for such ana- 
lyses are in an early stage of development, we chose 
to proceed with a conservative approach for detecting 
casexontrol splicing changes. Specifically, we analyzed 
exon-level RNA-Seq read counts using a linear ANOVA 
model fit to exon-level RNA-Seq read counts (see 
Methods). This model was designed to detect genes 
containing exons with increased casexontrol differen- 
tial expression relative to other exons within the same 
transcript , which would thus be potential candidates 
for exon skipping, the most common form of alternative 
splicing. Using this approach, we detected 45 transcripts 
that show statistically significant differential usage in at 
least one exon (pAC < 0.05 after 10% FDR correction) 
(Figure 6). These transcripts were then validated inde- 
pendently using the cuffdiff computational algorithm 
[14]. Eleven transcripts were validated by both methods 
and hence show promise for follow-up studies (Additional 
file 2: Tables S7 and S8). Interestingly, this list included 
the human leukocyte antigen gene HLA-H, variation in 
which has been associated with CAD susceptibility [29]. 

Differentially expressed genes are enriched for CAD 
association and disease-relevant functions 

We next determined if differentially expressed genes 
showed an overall relationship with cardiovascular dis- 
ease. To investigate this, we first combined previously 



compiled lists of CAD-associated genes [30,31]. From the 
combined list, we identified 3424 genes which showed 
expression in our RNA-Seq data. From this list, 43 genes 
are present in the list of 186 differentially expressed tran- 
scripts in our results. Statistically, this represented a highly 
significant enrichment of CAD-associated genes (one- 
tailed Fisher s Exact Test, p = 0.0002; Additional file 1: 
Figure S4), suggesting that gene expression differences 
we detected by comparing cell lines from CAC cases 
and controls showed an overall relationship with CAD. 

At the network level, differentially expressed genes 
may identify biological pathways that are perturbed in 
CAD. To investigate this hypothesis, we used Ingenuity 
Pathways Analysis™ (IPA) software, which performs sys- 
tems biology analysis using a large repository of previ- 
ously documented gene-gene interactions and functional 
annotations (Additional file 3). The highest-scoring net- 
work (IPA score = 48) in the list of 186 differentially 
expressed transcripts included the term "Cardiovascular 
System Development and Function", further highlighting 
the relationship of differential expression in this study 
with CAD. Interestingly, the second highest-scoring 
network (IPA score = 36) contained the function "Lipid 
Metabolism", suggesting another possible mechanism 
through which gene expression differences may be related 
to CAD etiology. Finally, Gene Ontology (GO) term ana- 
lysis of the 186 transcripts (using only LCL-expressed 
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Figure 6 Examples of alternative splicing detected using exon-level analysis of RNA-Seq counts in PIGQ (top panel) and VIM (bottom 
panel). X axis shows genomic location of exons (red and black dots for cases and controls, respectively) in hg 1 8 coordinates. Y axis shows 
mean-subtracted normalized RNA-Seq read counts for each exon. Arrows at bottom right show direction of transcription. Neither gene was 
differentially expressed when all exons were considered together. 
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genes as background to avoid bias), detected a number of 
networks related to cyclic AMP metabolism (Additional 
file 2: Table S9). 

Discussion 

We present here results from a pilot project for the use 
of RNA-Seq and cell lines in studying CAC gene expres- 
sion. We used gene expression microarrays and cell cul- 
ture replication for technical and experimental validation, 
respectively, of our initial results from this new technol- 
ogy. This project should be viewed in context of its scope 
and limitations; analyses of much larger numbers of LCLs 
will be required before any definitive statements can be 
made about the functional role in CAC of genes that we 
detected. However, our primary objective in this study was 
not functional analysis of such genes, but rather to identify 
candidate dysregulated genes in CAC using two relatively 
unexplored tools that may be useful overall in CAD re- 
search. By choosing LCLs from subjects who represent 
extreme outliers for the calcification score distribution, 
we hypothesized that between-group differences that 
are relevant for CAD may be detected. Evidence for this is 
reflected in the significant enrichment of CAD-associated 
genes in our results and further underscored in our IPA 
results, which show that differentially expressed genes we 
detected are concentrated in functional networks directly 
connected to CAD. However, we do not suggest that LCLs 
mirror in any way the transcriptional landscape of the 
atherosclerotic lesion itself. Rather, we hypothesize that 
our results represent eQTL-associated gene expression 
variants, which are stably propagated in these cell lines, 
underlying the observed differences between the case 
and control groups. We show that the combination of 
RNA-Seq and microarrays can be used to obtain useful 
results with a comparatively small sample size. In the 
near future, we anticipate that rapidly falling sequencing 
costs and development of user-friendly data analysis 
tools will soon make RNA-Seq the preferred alternative 
for transcriptome studies. 

As the scope of this pilot project did not include func- 
tional validation assays, we refrain from speculating on 
the specific mechanisms by which individual genes may 
be involved in atherosclerosis development. However, as 
to the best of our knowledge, this study represents the 
first whole-transcriptome analysis specifically focusing 
on coronary artery calcification, and differentially genes 
that we found include many previous CAD candidates 
(such as MMP7, EPAS1, ESAM, CASP1, GUCY1A3, 
CLCN4, LEF1, ENPP5, and ZHX2), some observations 
may be relevant to state here. Interestingly, we detected 
differential expression of ZHX2, a regulator of plasma 
lipid metabolism that is the closest gene to the top SNP 
locus from a recent atherosclerosis GWAS meta-analysis 
[32]. At the gene family level, the GIMAP family is 



enriched, with four genes (GIMAP1, GIMAP4, GIMAP5, 
and GIMAP7). Perhaps not by coincidence, multiple 
GIMAP family members have also been reported in a 
set of 128 genes (the "A-module") that is associated with 
atherosclerosis development. Overall, the GIMAP family 
has a crucial role in the development and function of T 
lymphocytes, which have an important role in the etiology 
of atherosclerosis [33]. Finally, it is intriguing that we de- 
tected differential expression of four genes involved in 
cAMP metabolism (ADM, APLP1, PRKCA and PTHLH), 
as existing studies focusing on individual genes involved 
in arterial calcification collectively suggest that perturb- 
ation of ATP metabolism plays a role in this process [34] . 

Conclusions 

In summary, in this project we piloted the study of cor- 
onary artery calcification using cell lines as a patient sur- 
rogate for gene expression. We demonstrate that with 
careful experimental design and secondary validation of 
results, statistically significant results can be obtained 
with RNA-Seq even from a small sample size. In addition 
to canonical gene expression, we focused on alternative 
splicing and novel transcript discovery, two areas of the 
CAD transcriptome that may benefit the most from fur- 
ther RNA-Seq analyses. Studies using larger numbers of 
LCLs along with follow-up experiments will be needed to 
validate differentially expressed genes from this pilot study 
as true CAC biomarkers; this will become possible in the 
near future as we deploy RNA-Seq to the entire ClinSeq® 
cohort. 
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